This section is a walk through the pre-alignment sequence filtering in ReproPhylo. We will start by several preliminaries discussed in the previous sections:



In [1]:

    
from reprophylo import *

pj = unpickle_pj('outputs/my_project.pkpj', git=False)









    



DEBUG:Cloud:Log file (/home/amir/.picloud/cloud.log) opened

3.5.1 Filtering by sequence length or GC content

At this point we have record features belonging to the loci in our Project. We have to split them by locus:



In [2]:

    
pj.extract_by_locus()

With this done, we can display length and %GC distribution for each locus:



In [3]:

    
%matplotlib inline
pj.report_seq_stats()









    



Distribution Of Sequence Lengths
Distribution Of Sequence Statistic "Gc_Content"






    



/usr/lib/pymodules/python2.7/matplotlib/figure.py:371: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

Now we'll exclude all the outliers:



In [4]:

    
# Define minima and maxima
gc_inliers = {
    '18s': [50,54],
    '28s': [57,67],
    'MT-CO1': [35,43]
}
len_inliers = {
    '18s': [1200,1800],
    '28s': [500,900],
    'MT-CO1': [500,1500]
}

# Apply to loci data
for locus in gc_inliers:
    # trim GC outliers
    pj.filter_by_gc_content(locus,
                            min_percent_gc=gc_inliers[locus][0],
                            max_percent_gc=gc_inliers[locus][1])
    
    # trim length outlier
    pj.filter_by_seq_length(locus,
                            min_length=len_inliers[locus][0],
                            max_length=len_inliers[locus][1])

We can now confirm that the filter worked:



In [5]:

    
pj.report_seq_stats()









    



Distribution Of Sequence Lengths
Distribution Of Sequence Statistic "Gc_Content"

3.5.2 Excluding and including

It is possible to exclude and include sequences by record id

3.5.2.1 Excluding

By default, excluding is done by starting with a full bin (all the sequences are included). In this case, since we have already filtered some sequences out, we need to start excluding from the current state and not from a full bin. Starting from a full bin by using the default setting start_from_max=True would undo the filtering by GC content and sequence length we have done above. As an example we will exclude JX177918.1 from the MT-CO1 Locus bin.



In [6]:

    
exclude = {'MT-CO1': ['JX177918.1']}
pj.exclude(start_from_max=False, **exclude)

The following line confirms that this record id is no longer in the MT-CO1 Locus bin.



In [7]:

    
any(['JX177918.1' in feature.id for feature in pj.records_by_locus['MT-CO1']])









    Out[7]:





False

3.5.2.2 Including

By default, including starts from empty bins, however here we want to keep the current state and only add one sequence:



In [8]:

    
include = {'MT-CO1': ['JX177918.1']}
pj.include(start_from_null=False, **include)

The following line confirms that this record was added back to the MT-CO1 Locus bin.



In [9]:

    
any(['JX177918.1' in feature.id for feature in pj.records_by_locus['MT-CO1']])









    Out[9]:





True



In [12]:

    
# Update the pickle file
pickle_pj(pj, 'outputs/my_project.pkpj')









    Out[12]:





'outputs/my_project.pkpj'

3.5.3 Quick reference



In [ ]:

    
# Split records to loci bins
pj.extract_by_locus()

# Show length and %GC distributions
%matplotlib inline
pj.report_seq_stats()

# Filter  by GC content
pj.filter_by_locus('LocusName',
                    min_percent_gc = 30,
                    max_percent_gc = 50)

# Filter  by sequence length
pj.filter_by_seq_length('LocusName',
                        min_length = 200,
                        max_length = 1000)

# Include or exclude records in the loci bins
records = {'LocusName1': ['recordid1','recordid2'],
            'LocusName2': ['recordid3','recordid4']}
pj.exclude(start_from_max=True, **records)

# or
pj.include(start_from_null=True, **records)



In [ ]: